NileChat-3B (Moroccan & Egyptian Arabic Dialectal LLM)

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

NileChat is a 3-billion parameter Large Language Model (LLM) adapted for Egyptian and Moroccan communities. It is designed to incorporate their specific language dialects, cultural heritage, and values. The model demonstrates proficiency in both Egyptian and Moroccan dialectal Arabic (using Arabic script and Arabizi), while also maintaining strong performance in Modern Standard Arabic (MSA), French, and English.

This model is the proof-of-concept resulting from the research paper "NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities".

Model Description

NileChat was developed to address the underrepresentation of low-resource languages and local cultures in existing LLMs. Current models often rely on translating English corpora, leading to an alignment with the source language's culture rather than the target local communities.

The NileChat methodology focuses on creating synthetic and retrieval-based pre-training data tailored to a specific community by considering its:

(i) Language: Dialectal nuances, idiomatic expressions, and unique linguistic structures.
(ii) Cultural Heritage: Customs, traditions, social norms, historical context, and common knowledge.
(iii) Cultural Values: Ethical standpoints, belief systems, and societal priorities. These are referred to as the Language-Heritage-Values (LHV) dimensions.

The project provides:

A novel framework for augmenting pre-training corpora for local communities.
New datasets for Egyptian and Moroccan Arabic dialects.
The NileChat model itself.

Intended Uses

NileChat is intended to improve LLM accessibility and relevance for Egyptian and Moroccan Arabic-speaking communities. It can be used for tasks requiring:

Understanding and generation in Egyptian and Moroccan dialects (Arabic script and Arabizi).
Translation between these dialects, MSA, English, and French.
Culturally aware interactions and content generation relevant to Egyptian and Moroccan contexts.
Applications requiring alignment with local societal values.

How to Use

Example interactions

Prompt (Egyptian Arabic): "اديني خمس نصايح ازاي احافظ على وزني" (Give me five tips on how to maintain my weight)
Prompt (Moroccan Arabic): "شنو هوما أحسن بلايص ممكن تمشي ليهوم فمراكش؟" (What are the best places to visit in Marrakech?)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "UBC-NLP/NileChat-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
messages = [
    {"role": "user", "content": "شنو هوما أحسن بلايص ممكن تمشي ليهوم فمراكش؟"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True, , add_generation_prompt=True)
outputs = model.generate(**input_ids, max_new_tokens=1024)
print(tokenizer.decode(outputs[0]))

Training Data

NileChat's pre-training and fine-tuning datasets were specifically curated to imbue linguistic and cultural competence for Egyptian (EGY) and Moroccan (MOR) Arabic.

Pre-training Data

A novel data augmentation pipeline was used, combining:

Machine Translation (MT) for Knowledge and Fluency:
- English educational content (5.5 million texts from Fineweb-edu) was translated into EGY and MOR dialects using the Command R+ teacher model. Focus on educational domain for topical breadth.
Controlled Generation for Cultural Heritage and Values:
- Diverse texts (stories, personal essays, blog posts, reviews, conversations) were generated in the target language.
- Components:
  - Local Contextual Information: From local news websites (approx. 1.5M EGY, 800k MOR articles in MSA).
  - Core Cultural Heritage Concepts: Extracted from country-specific Wikipedia portals (25k EGY, 49k MOR articles).
  - Linguistic and Cultural Expressions: Common expressions, proverbs, idioms, TV dialogues (600 utterances), and local terminology (4,000 dialect-to-English word pairs per dialect from Gatitos dictionary).
  - Representative Personas: 1,200 descriptions based on World Values Survey (WVS) data for Egyptian and Moroccan participants.
- Generated ~300k samples per genre for EGY, and ~150k samples per genre for MOR.
Retrieval for Local Cultural Heritage:
- Brave Search API queried with 6,500 Moroccan and 4,500 Egyptian cultural concepts across ten categories (food, clothes, landmarks, etc.).
- Collected 110k articles for EGY and 30k for MOR.

Arabizi Data: 1.5M generated educational/LHV samples for EGY and 0.5M for MOR were converted to Arabizi.

Final Pre-training Mixture: The generated and retrieved data were combined with pre-existing public data for EGY, MOR, MSA, English, French, Math, and Code to mitigate catastrophic forgetting. The resulting dataset comprises 98.57 billion words.

Type	Name	Hugging Face Link
Data	Fineweb-edu-Morocco
Data	Fineweb-edu-Egypt
Data	Arabizi-Egypt
Data	Arabizi-Morocco
Data	LHV-Egypt
Data	LHV-Morocco
Model	NileChat-3B

Supervised Fine-Tuning (SFT) Data

Due to scarcity of SFT datasets for EGY and MOR, a comprehensive set was constructed by:

Translating the SmolTalk dataset into MOR, EGY, French, and MSA.
Generating synthetic dialectal QA pairs from the retrieved local cultural heritage dataset.
Incorporating the Darija-SFT-Mixture MOR dataset.
Translating the TULU-V2-mix dataset into EGY.
Converting understanding/generation tasks from ORCA and Dolphin benchmark training sets into instruction-response formats. A portion of this data was converted to Arabizi.

Training Procedure

Teacher Model

Command R+ (104B) was used as the teacher model for translation and controlled generation due to its reasonable text-generation capabilities in target dialects and open weights.

Continued Pre-training

Base Model: Qwen-2.5-3B was selected due to competitive performance and good tokenizer compression on MSA.
The full 3.1B parameter model was pre-trained for one epoch on the curated dataset (98.57 billion words).
Sequence Length: 4,096.
Learning Rate: Linearly decayed from $5 \times 10^{-6}$ to $5 \times 10^{-7}$.
Weight Decay: 0.1.
Gradient Clipping: Norms clipped at 1.0.
Compute: Data augmentation took 1,096 hours on 4x A100 80GB GPUs.Continued pre-training took 750 hours on 4x A100 80GB GPUs.

Supervised Fine-Tuning (SFT)

Two separate variants of the pre-trained model were fully fine-tuned: one for MOR and one for EGY, each on its respective dialectal data (Arabic script & Arabizi) plus shared multilingual data (English, MSA, French SmolTalk; ORCA & Dolphin data).
Epochs: 2 for each dialect-specific model.
Sequence Length: 4,096.
Learning Rate: Linearly decayed from $7 \times 10^{-6}$ to $7 \times 10^{-7}$.
Model Merging: The two fine-tuned variants were merged using weighted linear averaging to create the final NileChat model.

Evaluation

NileChat was evaluated on understanding, cultural knowledge, translation, and value alignment.

Understanding

Tasks: MMLU, HellaSwag, Belebele (adapted to EGY and MOR).
Results: NileChat demonstrated SoTA performance compared to similar sized models, significantly outperforming its baseline Qwen2.5-3B-instruct (e.g., by ~10 points on MMLU, HellaSwag, Belebele). It also outperformed larger Arabic-focused models like AceGPT-13B and Jais-13B, and achieved on-par performance with ALLAM-7B.

Cultural Knowledge

Task: Palm benchmark (public test set for Morocco and Egypt), evaluated using Gemma-3-27b as LLM-as-Judge.
Results: NileChat achieved scores of 5.72 (EGY) and 5.86 (MOR), significantly up from Qwen2.5-3B-instruct's 2.86 (EGY) and 2.31 (MOR). It surpassed AceGPT-7B and -13B on Moroccan cultural knowledge.

Translation

Tasks: Flores-200 and an in-house human-curated dataset of authentic EGY/MOR utterances. Directions: dialect-dialect, dialect-MSA, English-dialect, French-dialect. Metric: spBLEU, ChrF++.
Results: NileChat achieved the highest average translation quality (spBLEU: 21.32) among evaluated models, including NLLB-200-3.3B (18.29) and ALLaM-7B (20.60). On the in-house dataset reflecting authentic speech, NileChat significantly outperformed all baselines.

Value Alignment

Task: Adapted World Values Survey (WVS) questions into multiple-choice format in the local language, across 13 dimensions. Metric: Social Value Alignment (SVA).
Results: NileChat showed substantial improvements over the baseline across most societal-value dimensions for both Morocco and Egypt. The approach of using a teacher LLM for role-playing with local personas successfully steered responses towards culturally aligned positions.

Performance Evolution

Model performance showed a large boost during the first 10B pre-training tokens, stabilizing around 60B tokens for Belebele and translation tasks.

Results for MMLU, HellaSwag, Belebele and Palm (cultural Knowledge) are presented below:

Model	MMLU		HellaSwag		Belebele		Palm
	EGY	MOR	EGY	MOR	EGY	MOR	EGY	MOR
Less than 7B
Qwen3-1.7B	28.53	28.53	28.44	27.47	22.89	22.89	3.61	2.12
ar-stablelm-2-chat	41.56	40.36	34.79	33.45	38.89	36.11	4.20	3.62
Atlas-Chat-2B	42.61	44.87	29.66	34.74	50.56	55.67	3.16	3.42
Llama-3.2-3B-Instruct	40.68	37.54	29.16	28.27	45.44	35.89	3.21	2.28
gemma-3-4b-it	40.79	32.70	34.21	31.35	37.33	34.22	7.61	5.42
Qwen3-4B	28.61	28.54	30.28	29.04	22.89	22.89	4.51	2.71
Qwen2.5-3B-Instruct	43.37	44.43	31.62	29.58	51.33	41.44	2.86	2.31
NileChat (3B)	57.56	57.36	37.97	39.33	72.67	70.33	5.72	5.86
More than 7B
AceGPT-7B-chat	40.29	37.57	33.27	30.47	32.67	32.00	5.58	3.93
ALLaM-7B-Instruct	60.04	58.72	39.40	37.30	69.56	57.78	6.78	6.14
Qwen2.5-7B-Instruct	49.65	44.98	34.67	32.16	64.22	48.56	6.70	4.77
Qwen3-8B	28.53	28.53	31.76	30.32	22.89	22.89	5.88	3.96
Atlas-Chat-9B	55.17	58.84	33.71	44.34	70.33	74.11	5.24	4.84
gemma-3-12b-it	61.17	60.00	38.59	35.66	75.78	64.89	8.76	7.09
AceGPT-13B-chat	45.45	40.68	35.06	32.40	38.78	36.44	6.10	4.83
jais-13b-chat	49.79	48.10	39.02	36.56	64.22	53.78	5.66	4.80

Zero-shot performance of models on under- standing and cultural knowledge evaluations. Metrics are accuracy for MMLU, HellaSwag, and Belebele, and a 0-10 correctness score for Palm. Bold values indicate the highest score among models comparable in size to ours (< 7B) and the highest score in the entire column, including larger models.

Results for translation capabilities using Flores-200 and our in-house dataset are presented below:

Model	Flores-200				In-House Data				Average
	XX →		→ XX		XX →		→ XX
	EGY	MOR	EGY	MOR	EGY	MOR	EGY	MOR
Less than 7B
Qwen3-1.7B	14.75	10.89	19.51	15.47	11.41	4.36	15.63	6.32	12.29
ar-stablelm-2-chat	14.35	7.07	11.10	9.72	9.23	2.92	11.23	7.73	9.17
Atlas-Chat-2B	15.20	13.40	21.39	21.11	5.36	7.83	14.52	13.54	14.05
Llama-3.2-3B-Instruct	14.25	9.15	19.28	15.54	10.67	3.16	13.61	4.87	11.32
gemma-3-4b-it	9.27	5.22	12.46	10.13	3.01	0.60	16.89	5.25	7.86
Qwen3-4B	17.93	11.64	20.03	18.90	13.09	4.44	20.72	8.52	14.41
NLLB-200-3.3B	23.93	15.37	25.84	26.57	16.77	7.49	18.90	11.43	18.29
Qwen2.5-3B-Instruct	15.14	11.27	20.52	17.37	9.91	4.19	19.24	7.83	13.18
NileChat (3B)	23.60	16.41	25.74	25.56	22.02	12.34	26.50	18.39	21.32
More than 7B
AceGPT-7B-chat	18.02	11.33	21.11	17.46	14.73	4.95	20.10	7.47	14.40
ALLaM-7B-Instruct	23.91	15.88	24.74	23.19	19.98	9.16	29.40	18.51	20.60
Qwen2.5-7B-Instruct	14.41	10.23	19.81	18.95	10.43	4.10	20.92	8.80	13.46
Qwen3-8B	20.03	13.86	22.56	21.33	13.38	4.73	24.14	9.27	16.16
Atlas-Chat-9B	18.20	16.89	24.92	26.29	5.36	7.68	17.35	15.23	16.49
gemma-3-12b-it	13.01	4.89	19.05	19.54	7.86	2.45	24.51	12.38	12.96
AceGPT-13B-chat	19.48	14.02	22.81	19.84	15.54	5.56	23.51	9.52	16.29
jais-13b-chat	8.80	4.29	15.77	17.12	10.83	4.02	19.19	12.47	11.56

Zero-shot translation performance (spBLEU) on the Flores and in-house datasets. XX →EGY and XX →MOR denote average over target languages EGY and MOR, respectively. Conversely, EGY → XX and MOR → XX indicate average over EGY and MOR as source languages. Bold values highlight the top score among models with fewer than 7 billion parameters and the highest score overall in each column.

Ethical Considerations

The work aims to develop inclusive, linguistically, and culturally diverse LLMs.
Pre-training and instruction-tuning data generation, while using a teacher LLM, was critically informed by ground-truth cultural values survey data (WVS) and local context.
Evaluations show reasonable alignment with the cultural heritage and values of the target communities.
No explicit safety alignment procedures were conducted. The authors strongly recommend thorough testing and further safety evaluations before any real-world deployment.

Limitations

As a smaller Large Language Model, NileChat-3B shares common limitations with other LLMs. These can include generating plausible yet incorrect information (hallucinations), sensitivity to prompt phrasing, and inconsistent performance with very long inputs. Although NileChat-3B aims to mitigate these issues, particularly for Arabic tasks, users should exercise critical judgment when evaluating its outputs, especially in crucial or fact-dependent situations.

Citation

If you use NileChat or the associated methodology, please cite the original paper:

@misc{mekki2025nilechatlinguisticallydiverseculturally,
      title={NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities}, 
      author={Abdellah El Mekki and Houdaifa Atou and Omer Nacar and Shady Shehata and Muhammad Abdul-Mageed},
      year={2025},
      eprint={2505.18383},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18383}, 
}

UBC-NLP
/

NileChat-3B

You need to agree to share your contact information to access this model